select(gapminder, country:pop, -continent)MET581 Lecture 03
Wrangling Data 1: dplyr
1 Tidy Data
There are 3 basic principles for tidy data:
- Each observation must have its own row
- Each variable must have its own column
- Each value must have its own cell
We should also keep to one type of observational unit per table
2 dplyr Introduction
There are 6 main verbs that you need to know to use dplyr effectively:
- select() → select variables by name
- filter() → return rows with matching conditions
- arrange() → arrange rows by variables
- mutate() → add new variables
- group_by() → return grouping of variables
- summarise() → reduces multiple values down to a single value
The important thing to remember is that all verbs follow the same format: - verb(dataset, options)
3 dplyr Notes and Exercises
3.1 Setup
Make sure you have all of the following installed and loaded to follow along:
- gapminder
- dplyr
- stringr
- nycflights13
In addition, you may want to install the following to explore in your own time:
- devtools (for installing packages from github)
- skimr (for fast and useful summaries of data)
3.2 dplyr::select()
There are lots of helper functions that work with select, e.g.:
- starts_with(“Hap”) -> ALL columns whose name starts with “Hap”
- ends_with(“ppy”) -> ALL columns whose name ends with “ppy”
- contains(“app”) -> ALL columns whose name contains “app”
- matches(“(SNP|CHR)”) -> ALL columns who name match a regular expression
- num_range(“x”, 1980:1983) -> ALL columns named x1980, x1981, x1982, and x1983
- one_of(char_vector) -> ALL columns who name appears in character vector (char_vector)
3.2.1 practice answers
Using gapminder
- Select the column range from country to population, but exclude continent
Using starwars
- load with
data("starwars") - Select name, height and all columns which end in the word ‘color’
select(starwars, height, ends_with('color'))Using flights
- load with library(nycflights13)
- Select all columns in the vector (‘year’, ‘month’, ‘day’, ‘carrier’, ‘flight’, ‘dest’)
cols_to_select <- c('year', 'month', 'day', 'carrier', 'flight', 'dest')
select(flights, one_of(cols_to_select))Note that we could have just used select(flights, cols_to_select), without including one_of(). However, if there was a column name in the vector cols_to_select that wasn’t present in our dataset, we would get an error, whereas the code above would still work correctly but give a warning. Stick to using one_of() where possible.
3.2.2 extra practice answers
Using gapminder
- Select country, year, and population columns from gapminder
# either of the following options are an acceptable answer
select(gapminder, country, year, pop)select(gapminder, -continent, -lifeExp, -gdpPercap)Using starwars
- Select only the columns stored as characters (tip: use
select_if())
# notice that you use is.character instead of is.character() i.e. the brackets are omitted
# also notice that we are using is.character, which tests to see if each column is of the
# type character. Using as.character() would fail, becuase this is used for converting to
# character, not testing for it
select_if(starwars, is.character)Using flights
- Select all columns ending in ‘time’, but exclude those starting with ‘sched’
# the important thing to note here is that we can combine more complex operations
# in a single select statement
select(flights, ends_with('time'), -starts_with('sched'))3.3 dplyr::filter()
Multiple arguments to filter can be combined using commas. This is equivalent to using &, which is the AND operator. It requires rows to meet all the conditions you give, and is a stringent filter. By contrast, using |, the OR operator, matches rows that meet one or more of your conditions. This is a much looser filter. You need to be careful that you’re using the operator you want, and don’t forget that a comma means AND, not OR.
3.3.1 practice answers
Using gapminder
- Keep only rows where continent is Americas or Europe
# you can either write out each filter, separated by the OR operator, or combine
# the two using %in%
# %in% is commonly used. It allows you to check if something, like a string, is
# present in a vector
# do NOT use == with a vector - it will give incorrect results
filter(gapminder, continent == "Americas" | continent == "Europe")filter(gapminder, continent %in% c("Americas", "Europe"))- Keep only rows where country is Canada or Australia and the year is before 1974
filter(gapminder, country %in% c('Canada', 'Australia'), year < 1974L)Using starwars
- Keep only the rows where name contains ‘light’ or eye_color contains ‘blue’ using
filter()andgrepl()
filter(starwars, grepl('light', name) | grepl('blue', eye_color))3.3.2 extra practice answers
Using gapminder
- Keep only the rows with life expectancy less than 35
filter(gapminder, lifeExp < 35)Using starwars
- Remove rows with brown hair colour
# the second option will exclude those with 'brown, blonde',
# whereas the first option will not
# removes everyone with exclusively brown hair
filter(starwars, hair_color != 'brown')# removes everyone containing the word 'brown' in their hair colour
filter(starwars, !grepl('brown', hair_color)) - Keep females with brown or blue eyes
# matching females with exclusively blue or exclusively brown eyes
filter(starwars, gender == 'female', eye_color %in% c('blue', 'brown'))Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 1 has 1 row to replace 0 rows
Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 2 has 1 row to replace 0 rows
Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 3 has 1 row to replace 0 rows
# matching females with eyes that contain brown or blue
# e.g. 'brown, green' is matched too
filter(starwars, gender == 'female', grepl('brown|blue', eye_color))Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 1 has 1 row to replace 0 rows
Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 2 has 1 row to replace 0 rows
Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`12` = "<>", :
replacement element 3 has 1 row to replace 0 rows
3.4 dplyr::arrange()
3.4.1 practice answers
Using gapminder
- arrange by country then continent
arrange(gapminder, country, continent)- arrange by country then descending year
arrange(gapminder, country, desc(year))3.5 dplyr::mutate()
3.5.1 practice answers
Using starwars
- Add a new column called ‘BMI’, created using height * mass
mutate(starwars, BMI = height*mass)- Dichotomise height into the strings ‘tall’ and ‘short’ and assign as factor to ‘height_dichot’ (tip: use
ifelse())
# an arbitrary cutoff for height was chosen in order to
# create the two categories
mutate(starwars, height_dichot = as.factor(ifelse(height < 160, 'Short', 'Tall')))3.5.2 extra practice answers
Using starwars
- Make the values in hair_color, skin_color, eye_color and gender Title Case (tip: use
str_to_title()fromstringr)
# Title Case Is Where Each Word Is Capitalised, As Done In This Sentence
# it's not useful for analysis, but can make our data look a bit neater
# when we present it to others
# here we are modifying the values in a column to be title case,
# which is what mutate does
# to instead modify the column names to be title case, use the rename()
# function from dplyr
# the long version is:
mutate(starwars, hair_color = str_to_title(hair_color),
eye_color = str_to_title(eye_color),
skin_color = str_to_title(skin_color),
gender = str_to_title(gender))# a cleaner version is:
starwars |>
mutate_at(vars(one_of(c('hair_color', 'skin_color',
'eye_color', 'gender'))),
funs(str_to_title))Using flights
- Assume ‘EWR’ is the code for unknown airport. Recode it to NA in columns ‘origin’ and ‘dest’ using
na_if()
mutate(flights, origin = na_if(origin, 'EWR'),
dest = na_if(dest, 'EWR'))3.6 Pipes
Pipes:
- pass the information forward to the next verb
- are a really useful way of expressing a series of operations
- allow us to quickly see what is being done
- mean we focus on the verbs, not the nouns
Native Pipes
- used to be in the magrittr package, then dplyr
- are now part of base R! (as of version 4.1.0)
- we can use them with
|>(no need to load tidyverse) - every time you see
|>here, you could use%>%instead (but load dplyr first) - you will still see
%>%in a lot of code in the wild, so it’s good to know both - you will need to use
%>%yourself if you’re forced to use an older version of R
3.6.1 practice answers
Using starwars
- Filter for all rows that don’t contain NAs and select columns that are characters, then create a new boolean column called ‘hair_eye_mismatch’, which is TRUE for anyone with exclusively brown hair and blue eyes
starwars |>
na.omit() |>
select_if(is.character) |>
mutate(hair_eye_mismatch = ifelse(hair_color == 'brown' & eye_color == 'blue', TRUE, FALSE))3.6.2 extra practice answers
Using flights - Show the carrier and flight number for flights with arrival delays greater than 10 and distances over 1000, sorted by descending distance
flights |>
filter(arr_delay > 10, distance > 1000) |>
arrange(desc(distance)) |>
select(carrier, flight)3.7 dplyr::group_by() and dplyr::summarise()
group_by() doesn’t do much on its own, but by combining group_by() and summarise() we get so much more power! We can start to show summary statistics broken down by different groups, like the mean life expectancy per continent, all after applying whatever filters we want.
# e.g.
gapminder |>
select(-pop) |>
filter(continent == 'Oceania', year > 1980L) |>
arrange(desc(gdpPercap)) |>
group_by(country) |>
summarise(n_years = n(),
Mean_Life_Exp = mean(lifeExp),
SD_Life_Exp = sd(lifeExp),
Max_GDP = max(gdpPercap))3.7.1 practice
Using gapminder
- Get the median and standard deviation for gdpPercap
summarise(gapminder, mean_GDP = mean(gdpPercap),
sd_GDP = sd(gdpPercap))Using airquality - load with data(“airquality”)
- Convert to tibble, remove Month and Day, keep Temperatures above 60, then summarise Solar.R with number, mean, median and standard deviation
# the airquality dataset is built-in to R, and can be loaded with data('airquality')
data("airquality")
airquality |>
as_tibble() |>
select(-Month, -Day) |>
filter(Temp > 60L) |>
summarise(n_obvs = n(),
Mean_solar = mean(Solar.R, na.rm = TRUE),
Median_solar = median(Solar.R, na.rm = TRUE),
SD_solar = sd(Solar.R, na.rm = TRUE))3.7.2 extra practice
Using airquality
- Summarise the mean values for integer columns only using
summarise_if()andis.integer
airquality |>
as_tibble() |>
summarise_if(is.integer, mean, na.rm=TRUE)Using starwars
- What is the mean height and weight of all Droids? Include a count of how many droids there are
starwars |>
filter(species == 'Droid') |>
summarise(N_droid = n(),
Mean_height = mean(height, na.rm = TRUE),
Mean_weight = mean(mass, na.rm = TRUE))- We want to compare Humans from different planets. Show the number of people and their mean height and mass, broken down by homeworld then gender. Make sure you show all rows when printing by piping the last line into
print(n = x), wherexis a suitably high number of rows
starwars |>
filter(species == 'Human') |>
group_by(homeworld, gender) |>
summarise(N_people = n(),
Mean_height = mean(height, na.rm = TRUE),
Mean_weight = mean(mass, na.rm = TRUE)) |>
print(n = 21)`summarise()` has grouped output by 'homeworld'. You can override using the
`.groups` argument.
# A tibble: 20 × 5
# Groups: homeworld [15]
homeworld gender N_people Mean_height Mean_weight
<chr> <chr> <int> <dbl> <dbl>
1 Alderaan feminine 1 150 49
2 Alderaan masculine 2 190. 79
3 Bespin masculine 1 175 79
4 Chandrila feminine 1 150 NaN
5 Concord Dawn masculine 1 183 79
6 Corellia masculine 2 175 78.5
7 Coruscant feminine 1 167 NaN
8 Coruscant masculine 1 170 NaN
9 Eriadu masculine 1 180 NaN
10 Haruun Kal masculine 1 188 84
11 Kamino masculine 1 183 78.2
12 Naboo feminine 2 175 45
13 Naboo masculine 3 179. 75
14 Serenno masculine 1 193 80
15 Socorro masculine 1 177 79
16 Stewjon masculine 1 182 77
17 Tatooine feminine 2 164 75
18 Tatooine masculine 6 184. 100.
19 <NA> feminine 2 NaN NaN
20 <NA> masculine 4 193 89
4 Closing notes
4.1 Other really useful verbs and tools
dplyr::glimpse()instead ofstr()- use
everything()to re-order columns- e.g.
select(starwars, new_column, everything())can be used to make new_column the first column
- e.g.
- use
n_distinct()instead oflength(unique())n_distinct()is commonly used insummarise()to see how many unique categories are in a column after filtering
dplyr::rename(new_name = old_name)for renaming columns- be aware that
dplyr::select_()and other variants exist for many of the main dplyr verbs dplyr::near()anddplyr::between()- integers can be compared with the equality operator ‘==’, e.g.
2 == 2 - however, doubles should never be compared this way because decimals are stored with finite precision
- this mean means that while
2 == 2is TRUE,sqrt(2) ^ 2 == 2evaluates to FALSE! - instead,
dplyr::near(sqrt(2) ^ 2, 2)should be used if you have to check for equality using doubles - we can combine it with filter, e.g.
filter(starwars, near(birth_year, 41.9)) dplyr::between()can be used to select ranges with filter, e.g.filter(starwars, between(birth_year, 30, 60))
- integers can be compared with the equality operator ‘==’, e.g.
coalesce(),recode()andcase_when()fromdplyrcoalesce()lets you replace NAs, e.g.mutate(starwars, gender = coalesce(gender, 'unknown'))recode()can be used to recode values in a tibble e.g.mutate(starwars, eye_color = recode(eye_color, 'brown' = 'maroon'))- this will change all instances of ‘brown’ in the eye_color column to be ‘maroon’ instead
case_when()is a good alternative to using nested ifelse() statements
tibble::rownames_to_column()simply takes the row names (if present) and makes them a column- e.g.
data(mtcars)has cars as row names, andtibble::rownames_to_column(mtcars, 'cars')makes this into a column instead
- e.g.
%<>%- a useful, but potentially dangerous, way of overwriting your original dataframe with your modified one
- it is generally not recommended, as code should be as explicit as possible while still being readable
- instead, it is usually best to assign your dataset using
new_df <- old_df |> select() |> etc()
4.2 Things to be aware of
- Packages can have functions with the same name - sometimes it helps to be specific e.g.
dplyr::select() - We can now used the conflicted package to be more explicit about this!
- See the introductory blog post from Hadley here
- Filtering using
grepl()is better done using stringr’sstr_detect()(introduced tomorrow) stringr(tomorrow) combined withdplyr(more tomorrow) andmaggritrwill cover a lot of your everyday needs
4.3 Should we always use dplyr?
What if:
- you have a single operation? or 20?
- a single operation doesn’t need the pipe as it is just a one-liner
- if you have lots of lines, it may be better to split up your code in the chunks of around ten lines
- this is because as the number of lines increases, so does the likelihood you will make a mistake!
- splitting it up at checkpoints allows you to check any important intermediate steps have worked
- the flow of operations isn’t linear?
- you might not always be doing operations sequentially, in which case the pipe is not appropriate
- you need to inspect an intermediate step?
- as mentioned above, it is often best to split up code to check intermediate steps, for your own sanity!
- speed is more important to you than readability?
- dplyr is fast, expressive and readable, but some other packages are faster
- the data.table package is faster for reading large datasets and manipulating them, though you might lose some readability
Base R should also not be ignored. Some simple operations are shorter/cleaner in base R, like subsetting a single dataframe column by index.
4.4 Homework
- Quarto file
- Regex practice before tomorrow
Suggested Reading
- R for Data Science 2e, chapter 4
- The docs and vignette at
browseVignettes(package = "dplyr")
4.5 The Tidyverse Life
If you really want to be proficient with the tidyverse packages, you need to practice. Pick a dataset on something you’re interested in and try to answer questions you have - google dataset search is your friend. That should be your priority. After you’ve practised a lot and you feel comfortable, you might want even more reading and learning. Here are some starters:
If you really can’t get enough tidyverse in your life:
- Tidyverse blog
- follow @hadleywickham, @posit_pbc, @tidyverse and #rstats